automatic speech recognition system
Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts
Park, ChaeHun, Cho, Hojun, Choo, Jaegul
This paper explores integrating Automatic Speech Recognition (ASR) into natural language query systems to improve weather forecasting efficiency for Korean meteorologists. We address challenges in developing ASR systems for the Korean weather domain, specifically specialized vocabulary and Korean linguistic intricacies. To tackle these issues, we constructed an evaluation dataset of spoken queries recorded by native Korean speakers. Using this dataset, we assessed various configurations of a multilingual ASR model family, identifying performance limitations related to domain-specific terminology. We then implemented a simple text-to-speech-based data augmentation method, which improved the recognition of specialized terms while maintaining general-domain performance. Our contributions include creating a domain-specific dataset, comprehensive ASR model evaluations, and an effective augmentation technique. We believe our work provides a foundation for future advancements in ASR for the Korean weather forecasting domain.
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Asia > Singapore (0.04)
Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer
Jin, Weifei, Cao, Yuxin, Su, Junjie, Shen, Qi, Ye, Kai, Wang, Derui, Hao, Jie, Liu, Ziyao
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.
- Asia > Singapore (0.05)
- Asia > China > Beijing > Beijing (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
A Comprehensive Study of the Current State-of-the-Art in Nepali Automatic Speech Recognition Systems
Ghimire, Rupak Raj, Bal, Bal Krishna, Poudyal, Prakash
In this paper, we examine the research conducted in the field of Nepali Automatic Speech Recognition (ASR). The primary objective of this survey is to conduct a comprehensive review of the works on Nepali Automatic Speech Recognition Systems completed to date, explore the different datasets used, examine the technology utilized, and take account of the obstacles encountered in implementing the Nepali ASR system. In tandem with the global trends of ever-increasing research on speech recognition based research, the number of Nepalese ASR-related projects are also growing. Nevertheless, the investigation of language and acoustic models of the Nepali language has not received adequate attention compared to languages that possess ample resources. In this context, we provide a framework as well as directions for future investigations.
- Asia > Singapore (0.05)
- Asia > India (0.05)
- North America > United States > New York (0.04)
- (5 more...)
- Overview (1.00)
- Research Report > New Finding (0.68)
Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring
Sudarshan, Ankitha, Samuel, Vinay, Patwa, Parth, Amara, Ibtihel, Chadha, Aman
Automatic Speech Recognition (ASR) has witnessed a profound research interest. Recent breakthroughs have given ASR systems different prospects such as faithfully transcribing spoken language, which is a pivotal advancement in building conversational agents. However, there is still an imminent challenge of accurately discerning context-dependent words and phrases. In this work, we propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing leveraging the power of deep learning models in accurately delivering spot-on transcriptions across a wide variety of vocabularies and speaking styles. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models integrating both language and acoustic modeling for better accuracy. We infused our network with the use of a transformer-based model to properly rescore the word lattice achieving remarkable capabilities with a palpable reduction in Word Error Rate (WER). We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.88)
Focus on Whisper, OpenAI's automatic speech recognition system - Actu IA
OpenAI recently released Whisper, a 1.6 billion parameter AI model capable of transcribing and translating speech audio from 97 different languages, showing robust performance on a wide range of automated speech recognition (ASR) tasks. The model trained on 680,000 hours of audio data collected from the web was soon published as open source on GitHub. Whisper uses a transform-encoder-decoder architecture, the input audio is split into 30-second chunks, converted to a log-Mel spectrogram, and then passed through an encoder. Unlike most state-of-the-art ASR models, it has not been fitted to a specific data set, but instead has been trained using weak supervision on a large-scale noisy data set collected from the Internet. Although it did not beat the specialized LibriSpeech performance models, in zero-shot evaluations on a diverse dataset, Whisper proved to be more robust and made 50% fewer errors than those models.
Automatic Speech Recognition of Low-Resource Languages Based on Chukchi
Safonova, Anastasia, Yudina, Tatiana, Nadimanov, Emil, Davenport, Cydnie
The following paper presents a project focused on the research and creation of a new Automatic Speech Recognition (ASR) based in the Chukchi language. There is no one complete corpus of the Chukchi language, so most of the work consisted in collecting audio and texts in the Chukchi language from open sources and processing them. We managed to collect 21:34:23 hours of audio recordings and 112,719 sentences (or 2,068,273 words) of text in the Chukchi language. The XLSR model was trained on the obtained data, which showed good results even with a small amount of data. Besides the fact that the Chukchi language is a low-resource language, it is also polysynthetic, which significantly complicates any automatic processing. Thus, the usual WER metric for evaluating ASR becomes less indicative for a polysynthetic language. However, the CER metric showed good results. The question of metrics for polysynthetic languages remains open.
- North America > United States (0.14)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.05)
- Asia > Russia > Far Eastern Federal District > Chukotka Autonomous Okrug > Anadyr (0.04)
An Automatic Speech Recognition System for Bengali Language based on Wav2Vec2 and Transfer Learning
An independent, automated method of decoding and transcribing oral speech is known as automatic speech recognition (ASR). A typical ASR system extracts feature from audio recordings or streams and run one or more algorithms to map the features to corresponding texts. Numerous of research has been done in the field of speech signal processing in recent years. When given adequate resources, both conventional ASR and emerging end-to-end (E2E) speech recognition have produced promising results. However, for low-resource languages like Bengali, the current state of ASR lags behind, although the low resource state does not reflect upon the fact that this language is spoken by over 500 million people all over the world. Despite its popularity, there aren't many diverse open-source datasets available, which makes it difficult to conduct research on Bengali speech recognition systems. This paper is a part of the competition named `BUET CSE Fest DL Sprint'. The purpose of this paper is to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework. The proposed method effectively models the Bengali language and achieves 3.819 score in `Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
How To Fool an Eavesdropping AI … With Another AI
Scientists at Columbia University in New York City think they've devised an AI that can effectively fool an eavesdropping automatic speech recognition system from transcribing your private conversation. So in the future, you may not have to worry that someone is using spyware to record your phone calls, or that your Alexa is listening in when it shouldn't be. Their Neural Voice Camouflage system prevents eavesdroppers from secretly transcribing your audio conversation by piggybacking a custom static-type noise over your speech. The noise is set to the same volume as normal background noise--no louder than a regular background air conditioning unit--so people you're talking to can still easily make out what you're saying. However, the automatic speech recognition system (ASR) that's attempting to eavesdrop will get confused and produce a Gobbledygook transcription, as you can see in the demonstration below: This process of producing a custom background noise is more complicated than it seems.
Stopping Smart Devices From Spying on You - Neuroscience News
Summary: Researchers have developed a new AI algorithm that prevents smart devices such as Alexa or Siri from correctly hearing your words 80% of the time. The algorithm is a step toward providing personal agency in protecting the privacy of their voice in the presence of smart devices. Ever noticed online ads following you that are eerily close to something you've recently talked about with your friends and family? Microphones are embedded into nearly everything today, from our phones, watches, and televisions to voice assistants, and they are always listening to you. Computers are constantly using neural networks and AI to process your speech, in order to gain information about you.
Stopping 'them' from spying on you: New AI can block rogue microphones
Ever noticed online ads following you that are eerily close to something you've recently talked about with your friends and family? Microphones are embedded into nearly everything today, from our phones, watches, and televisions to voice assistants, and they are always listening to you. Computers are constantly using neural networks and AI to process your speech, in order to gain information about you. If you wanted to prevent this from happening, how could you go about it? Back in the day, as portrayed in the hit TV show "The Americans," you would play music with the volume way up or turn on the water in the bathroom.